BMJ Health & Care Informatics
● BMJ
Preprints posted in the last 30 days, ranked by how well they match BMJ Health & Care Informatics's content profile, based on 13 papers previously published here. The average preprint has a 0.10% match score for this journal, so anything above that is already an above-average fit.
Adekunle, T.; Ohaeche, J.; Adekunle, T.; Adekunle, D.; Kogbe, M.
Show abstract
BackgroundArtificial intelligence is increasingly embedded in healthcare delivery. Its legitimacy depends on institutional governance, not technical performance alone. Prior research has centered on clinicians and patients. Less attention has been given to cybersecurity professionals who sustain the digital infrastructures that support health AI. This study examines how cybersecurity professionals conceptualize AI as clinical infrastructure and how these interpretations shape understandings of trust, risk, and oversight. MethodsGuided by sociotechnical systems theory and institutional trust scholarship, we conducted semi-structured in-depth interviews with twenty cybersecurity professionals working in healthcare-relevant domains. Participants were recruited through professional networks and LinkedIn outreach. Interviews were conducted between May and August 2025. They were audio-recorded and transcribed verbatim. Data were analyzed using qualitative content analysis with constant comparison. Two researchers independently coded transcripts and refined themes through iterative discussion. The study received Institutional Review Board approval. ResultsParticipants described health AI as an augmented clinical infrastructure. They emphasized that AI extends workflow capacity but requires sustained human oversight. Healthcare data systems were characterized as fragmented and vulnerable. Breaches were treated as anticipated events. Trust in AI was described as contingent and built over time through visible accountability. Cybersecurity stewardship was framed as foundational to institutional trustworthiness. ConclusionsHealth AI credibility emerges through governance practices that demonstrate accountability. Cybersecurity professionals and institutional stakeholders jointly shape trust in digitally mediated healthcare systems through governance decisions that signal accountability.
Al-Dabbas, Z.; Khandakji, L.; Al-Shatarat, N.; Alqaisiah, H.; Ibrahim, Y.; Awed, T.; Baik, H.; Dawoud, M.; Ali, R. A.-H.; Telfah, Z.; Al-Hmaid, Y.; Alsharkawi, A.
Show abstract
Artificial intelligence (AI) is increasingly integrated into healthcare delivery, yet patient acceptance in resource constrained settings remains incompletely characterized. This study assessed attitudes toward AI supported care among patients attending hospitals in three Jordanian governorates (Amman, Balqa, Irbid) and examined demographic and digital literacy correlates of acceptance. In a cross sectional survey (n = 500 complete questionnaires), participants rated exposure to AI in healthcare and five attitudinal domains, namely perceived usefulness or performance expectancy, trust and transparency, privacy and perceived risks, empathy and human interaction, and readiness or behavioral intention, using 25 items on 5 point Likert scales. Patients expressed conditional optimism: empathy and human interaction was most strongly endorsed (M = 4.33, SD = 0.58), alongside relatively high perceived usefulness (M = 3.97, SD = 0.68), while trust and transparency (M = 3.57, SD = 0.74) and readiness (M = 3.66, SD = 0.90) were moderate to high; privacy and risk concerns were moderate (M = 3.51, SD = 0.77) and self reported exposure was lowest (M = 2.57, SD = 1.07). The highest agreement item indicated preference for AI to work alongside physicians rather than be relied on alone (M = 4.47, SD = 0.81). Trust and transparency and perceived usefulness were positively associated with readiness (r = 0.48 and r = 0.44, respectively; p <.001), while privacy and perceived risks were negatively correlated with trust and usefulness. In multivariable regression adjusting for gender, age group, education, prior AI health app or device use, and self rated digital skill, lower educational attainment (less than high school and high school) predicted reduced readiness, whereas higher digital skill predicted increased readiness (R2 = 0.101). These findings suggest that implementation strategies in Jordan should emphasize human involvement alongside AI, transparent communication and governance, and interventions that build digital confidence and reduce readiness gaps linked to education. Author summaryAI is increasingly used in healthcare, for example to support diagnosis, triage, and treatment decisions. Whether these tools are accepted by patients depends not only on how well they work, but also on whether patients trust them, understand how they are used, and feel their privacy is protected. Evidence on patient views in middle income and resource constrained settings is still limited. We surveyed 500 patients attending hospitals in three Jordanian governorates to understand how they view AI supported care. Patients generally expected AI to be useful, but they strongly preferred that clinicians remain actively involved and that AI supports rather than replaces physicians. Trust and perceived usefulness were closely linked to willingness to accept AI enabled care, while privacy concerns were present and shaped trust. Readiness to accept AI was lower among participants with lower educational attainment and higher among those with greater self rated digital skill. These findings suggest that successful implementation in Jordan should prioritize transparent communication, strong privacy safeguards, and human centered workflows, while also strengthening digital confidence to avoid widening gaps in acceptance.
Ng, J. Y.; Bhavsar, D.; Krishnamurthy, M.; Dhanvanthry, N.; Fry, D.; Kim, J. W.; King, A.; Lai, J.; Makwanda, A.; Olugbemiro, P.; Patel, J.; Virani, I.; Ying, E.; Yong, K.; Zaidi, A.; Zouhair, J.; Lee, M. S.; Lee, Y.-S.; Nesari, T. M.; Ostermann, T.; Witt, C. M.; Zhong, L.; Cramer, H.
Show abstract
BackgroundArtificial intelligence chatbots (AICs) are increasingly being integrated into scholarly publishing, with the potential to automate routine editorial tasks and streamline workflows. In traditional, complementary, and integrative medicine (TCIM) publishing, editorial and peer review processes can be particularly complex due to diverse methodologies and culturally embedded knowledge systems, presenting unique opportunities and challenges for AIC adoption. MethodsAn anonymous, online cross-sectional survey was distributed to the editorial board members of 115 TCIM journals. The survey assessed familiarity and current use of AICs, perceived benefits and challenges, ethical concerns, and anticipated future roles in editorial workflows. ResultsOf 5119 invitations, 217 eligible participants completed the survey. While approximately 70% of respondents reported familiarity with AI tools, over 60% had never used AICs for editorial tasks. Editors expressed strongest support for text-focused applications, such as grammar and language checks (81.0%) and plagiarism/ethical screening (67.4%). Most respondents (82.8%) believed that AICs would be important or very important to the future of scholarly publishing; however, the majority (65.3%) reported that their journals lacked AI-specific policies and training programs to guide editors and peer reviewers. ConclusionsMost TCIM editors believe that AICs have potential to support routine editorial functions but also have limited adoption into editorial and peer review processes due to practical, ethical, and institutional barriers. Additional training and guidance are warranted by journals to direct responsible and ethical use if AICs are to be adopted in TCIM academic publishing.
Bladder, K. J. M.; Verburg, A. C.; Arts-Tenhagen, M.; Willemsen, R.; van den Broek, G. B.; Driessen, C. M. L.; Driessen, R. J. B.; Robberts, B.; Scheffer, A. R. T.; de Vries, A. P.; Frenzel, T.; Swillens, J. E. M.
Show abstract
BackgroundGenerative artificial intelligence (GenAI) in healthcare may reduce administrative burden and enhance quality of care. Large language models (LLMs) can generate draft responses to patient messages using electronic health record (EHR) data. This could mitigate increased workload related to high message volumes. While effectiveness and feasibility of these GenAI tools have been studied in the United States, evidence from non-English contexts is scarce, particularly regarding user experience. ObjectiveThis study evaluated the effectiveness, feasibility and barriers and facilitators of implementing Epics Augmented Response Technology (Art) GenAI tool (Epic Systems Corporation, Verona, WI, USA) in a Dutch academic healthcare setting among a broad range of end users. It explored healthcare professionals (HCP) usage metrics, expectations, and early user experiences. MethodsWe conducted a hybrid type 1 effectiveness-implementation design. HCPs of four clinical departments (dermatology, medical oncology, otorhinolaryngology, and pulmonology) participated in a six-month study. Effectiveness of Art was assessed using efficiency indicators from Epic (including all InBasket users in the hospital) and survey scales measuring well-being and clinical efficiency at three time points: PRE, POST-1 (1 month), and POST-2 (4 months). Feasibility of Art was evaluated through adoption indicators from Epic and survey scales on use and usability. Barriers and facilitators of Art implementation were collected through the survey and thematized using the NASSS framework (Nonadoption, Abandonment, Scale-up, Spread and Sustainability). Results237 unique HCPs generated a total of 8,410 drafts. Review and drafting times were similar for users with and without Art, indicating minimal differences. Perceived clinical efficiency declined significantly from PRE to POST-2, while well-being remained unchanged. Adoption was initially high but decreased over time, averaging 16.7% across departments. Usability and intention-to-use scores also declined significantly. Oualitative findings highlighted time savings, well-structured drafts, and patient-centered language as facilitators. Reported barriers included limited impact on time, low practical utility, content inaccuracies, and style misalignment. ConclusionsThis evaluation of a GenAI tool for patient-provider communication in a non-English academic hospital revealed mixed perceptions of effectiveness and feasibility. High initial expectations contrasted with limited perceived impact on time-savings, well-being and clinical efficiency, alongside declining adoption and usability. Barriers and facilitators revealed contrasting views. These findings underscore the need for a workflow for the handling of user feedback, guidance on clinical responsibilities, along with clear communication about the tools purpose and limitations to manage expectations. Additionally, establishing consensus on a set of quality indicators and their thresholds that indicate when a GenAI tool is sufficiently robust will be critical for responsible scaling of GenAI in clinical practice.
Yip, A.; Craig, G.; White, N. M.; Cortes-Ramirez, J.; Shaw, K.; Reddy, S.
Show abstract
PurposeTo evaluate whether large language models (LLMs) can enhance clinician-patient communication by simplifying radiology reports to improve patient readability and comprehension. MethodsA randomised controlled trial was conducted at a single healthcare service for patients undergoing X-ray, ultrasound or computed tomography between May 2025 and June 2025. Participants were randomised in a 1:1 ratio to receive either (1) the formal radiology report only or (2) the formal radiology report and an LLM-simplified version. Readability scores, including the Simple Measure of Gobbledygook, Automated Readability Index, Flesch Reading Ease, and Flesch-Kincaid grade level, were calculated for both reports. Statistical analysis of patient readability and comprehension levels, factual accuracy and hallucination rates for LLMs was assessed using a combination of binary and 5-point Likert scales, open-ended survey questions, and independent review by two radiologists. Results59/120 patients were randomised to receive both the formal and LLM-simplified radiology reports. Readability of LLM-simplified reports significantly improved with the reading level required for formal reports equivalent to a university-standard (11th-13th grade) compared to a middle-school standard (5th-9th grade) for simplified reports (rank biserial correlation=0.83, p<0.001). Patients with both reports demonstrated a significantly greater comprehension level, with 95% reporting an understanding level greater than 50%, compared with 46% without the simplified report (rank biserial correlation = 0.67, p < 0.001). All LLM-simplified reports were considered at least somewhat accurate with a minimal hallucination rate of 1.7%. Importantly, no hallucinations resulted in potential patient harm. 118/120 (98.3%) patients expressed interest in simplified radiology reports to be included in future clinical practice. ConclusionThis study provides evidence that LLMs can simplify radiology reports to an accessible level of readability with minimal hallucination. LLMs improve both ease of readability and comprehension of radiology reports for patients. Therefore, the rapid advancement of LLMs shows strong potential in enhancing patient-radiologist communication as patient access to electronic health records is increasingly adopted. HighlightsO_LIRadiology reports can be complex and difficult for patients to read and interpret C_LIO_LIStrong patient demand exists for simplified radiology reports C_LIO_LILarge language models (LLMs) such as GPT-4o show promise in simplifying radiology reports C_LIO_LILLMs credibly simplify radiology reports with minimal hallucination rates C_LIO_LILLMs improve both patient readability and comprehension of radiology reports C_LI
Park, J.-H.; Kim, S.-Y.
Show abstract
BackgroundSouth Koreas healthcare system, while technologically advanced, faces persistent inefficiencies in health information exchange (HIE) across its fragmented hospital network. Blockchain technology has been proposed as a decentralised infrastructure for secure, interoperable health data sharing, yet empirical evidence quantifying the efficiency gains attributable to blockchain-based HIE systems at the hospital network level remains absent. Traditional performance metrics fail to distinguish between technical inefficiency (suboptimal use of existing resources) and frontier shifts (technological improvements enabling new performance levels). ObjectiveTo estimate the technical efficiency of health information exchange across South Korean hospital networks and to quantify the efficiency differential attributable to blockchain-enabled versus conventional HIE platforms using Stochastic Frontier Analysis (SFA) with Bayesian Model Averaging (BMA). MethodsWe conducted a panel study of 247 hospital networks (comprising 1,842 individual hospitals) across all 17 South Korean provinces and metropolitan cities over 16 quarters (Q1 2021-Q4 2024). The dataset was constructed from the Health Insurance Review and Assessment Service (HIRA) claims database, the Korean Health Information Exchange registry, and hospital-reported digital infrastructure surveys. We specified a translog stochastic production frontier with time-varying inefficiency, where HIE output (composite index of data completeness, exchange volume, interoperability score, and clinical decision support utilisation) was modelled as a function of inputs including IT staff, digital infrastructure investment, electronic health record maturity, and network connectivity. Bayesian Model Averaging over 2^18 candidate specifications addressed model uncertainty in covariate selection for the inefficiency determinants equation. The blockchain treatment effect was estimated using a control function approach with Mundlak-Chamberlain correlated random effects to address endogenous adoption timing. Secondary analyses examined whether blockchains efficiency impact varied by hospital network characteristics using a latent class stochastic frontier model. ResultsThe mean technical efficiency of HIE across all networks was 0.67 (SD = 0.14), indicating that the average network achieved only 67% of its frontier potential output given its input levels. Blockchain-adopting networks (n = 83, 33.6%) demonstrated significantly higher mean technical efficiency (0.78, 95% CrI: 0.75-0.81) compared to conventional HIE networks (0.61, 95% CrI: 0.58-0.64), yielding an efficiency differential of 0.17 (95% CrI: 0.13-0.21). BMA identified blockchain adoption (posterior inclusion probability [PIP]: 0.98), network size (PIP: 0.94), EHR maturity level (PIP: 0.91), and dedicated IT governance structures (PIP: 0.87) as the most robust determinants of technical efficiency. The control function approach confirmed that the blockchain effect was robust to endogeneity concerns (Hausman test p = 0.34). Latent class analysis identified three distinct efficiency regimes: "Digital Leaders" (24.3% of networks, mean efficiency: 0.84), "Transitioning Networks" (48.6%, mean efficiency: 0.68), and "Digital Laggards" (27.1%, mean efficiency: 0.49). Blockchain adoption shifted the probability of Digital Leader classification by 31.2 percentage points (95% CrI: 24.8-37.6). ConclusionsSouth Korean hospital networks operating with blockchain-enabled HIE infrastructure achieve substantially higher technical efficiency in health data exchange, with the efficiency advantage persisting after accounting for model uncertainty and endogenous adoption. These findings provide the first large-scale econometric evidence supporting blockchains operational value in healthcare information infrastructure and have direct implications for South Koreas ongoing national digital health strategy and for international health systems considering blockchain-based interoperability solutions.
Edara, R.; Khare, A.; Atreja, A.; Awasthi, R.; Highum, B.; Hakimzadeh, N.; Ramachandran, S. P.; Mishra, S.; Mahapatra, D.; Shree, S.; Bhattacharyya, A.; Singh, N.; Reddy, S.; Cywinski, J. B.; Khanna, A. K.; Maheshwari, K.; Papay, F. A.; Mathur, P.
Show abstract
BackgroundBreakthroughs in model architecture and the availability of data are driving transformational artificial intelligence in healthcare research at an exponential rate. The shift in use of model types can be attributed to multimodal properties of the Foundation Models, better reflecting the inherently diverse nature of clinical data and the advancing model implementation capabilities. Overall, the field is maturing from exploratory development towards application in real-world evaluation and implementation, spanning both Generative and predictive AI. MethodsDatabase search in PubMed was performed using the terms "machine learning" or "artificial intelligence" and "2025", with the search restricted to English-language human-subject research. A BERT-based deep learning classifier, pre-trained and validated on manually labeled data, assessed publication maturity. Five reviewers then manually annotated publications for healthcare specialty, data type, and model type. Systematic reviews, duplicates, pre-prints, robotic surgery studies, and non-human research publications were excluded. Publications employing foundation models were further analyzed for their areas of application and use cases. ResultsThe PubMed search yielded 49,394 publications, a near-doubling from 28,180 in 2024, of which 3,366 were classified as mature. 2,966 were included in the final analysis after exclusions, compared to 1946 in 2024. Imaging remained the dominant specialty (976 publications), followed by Administrative (277) and General (251). Traditional text-based LLMs (1,019) led model usage, but Multimodal Foundation Models surged from 25 publications in 2024 to 144 in 2025, and Deep Learning models also increased substantially (910). For the first time, publications related to classical Machine Learning model use declined (173) in our annual review. Image remained the predominant data type (53.9%), followed by text (38.2%), with a notable increase in audio (1.2%) coinciding with the adoption of multimodal models. Across foundation model publications, Imaging (110), Head and Neck (92), Surgery (64), Oncology (55), and Ophthalmology (49) were leading specialties, while Administrative and Education categories remained high-volume contributors driven predominantly by LLM-based research. Conclusion2025 signals a meaningful maturation of the healthcare AI research field, with publication volumes nearly doubling, classical ML yielding to higher-capacity foundation models, and the field rapidly moving beyond traditional text-based LLM capabilities toward multimodal models. While Imaging continues to lead in research output, the growth of multimodal models across clinical specialties suggests the field is approaching an inflection point where AI systems can more closely mirror the complexity of real-world clinical practice.
Farquhar, H. L.
Show abstract
Natural language processing was applied to 3,586 Australian health practitioner tribunal decisions (1999-2026) to identify patterns in professional misconduct, outcomes, and temporal trends at a scale impractical through manual analysis. A text classification approach categorised 2,428 disciplinary decisions across seven misconduct types with acceptable accuracy for the major categories (per-class F1 0.47-0.82). Boundary violations were the most prevalent misconduct type (30.2%), followed by dishonesty/fraud (29.7%) and professional conduct breaches (28.0%). Reprimand was the most common outcome (53.0%), followed by cancellation (40.2%). Significant increasing trends were identified for boundary violations, dishonesty/fraud, professional conduct breaches, and communication failures. Boundary violations were associated with higher cancellation odds (OR = 1.36, p < 0.001). Opioid medications appeared in 67% of prescribing misconduct decisions. Significant jurisdictional variation in both misconduct types and outcomes was observed, with large effect sizes between major jurisdictions. The findings provide an empirical foundation for monitoring disciplinary trends under the National Law.
Zafar, W.; Tavares, S.; Hu, Y.; Brubaker, L.; Green, J.; Mehta, S.; Grams, M. E.; Chang, A. R.
Show abstract
BackgroundAlbuminuria is associated with increased risk of cardiovascular disease (CVD), heart failure, and progression of chronic kidney disease (CKD). Early detection of albuminuria, done through spot urine albumin creatinine ratio (UACR) testing, enables more accurate risk stratification and timely use of preventative therapies. It remains unacceptably low in the hypertension population. MethodsWe evaluated two EHR-embedded clinical decision support (CDS) strategies at Geisinger Health System in order to increase UACR testing in individuals with hypertension: an OurPractice Advisory (OPA) from Jan 2022 to Aug 2022; and a Health Maintenance Topic (HMT) in the Care Gaps section of Storyboard from Aug 2022 that continues to date. We evaluated UACR rates from 2020 to 2023 in Geisinger primary care and compared to a control group of healthcare systems in the Optum Labs Data Warehouse [OLDW]. Patients were excluded if they had UACR testing in the preceding 3 years, had diabetes or CKD, or were receiving palliative/hospice care. ResultsWe included 58,876 individuals in Geisinger (mean age 59.4 years, 49.6% female) and 1,427,754 in OLDW (61.0 years, 49% female). UACR testing in Geisinger (2.97% in 2020; 2.8% in 2021; 9.7% in 2022; 17.5% in 2023) showed significant increase compared to the control health systems (2.08%, 2.26%, 3.35% and 3.40% respectively). Results were consistent after adjusting for age, sex and race. ConclusionOPA increased UACR testing [~]3-fold whereas the HMT was associated with further improvements ([~]6-fold vs. baseline) among those with hypertension, suggesting an important role for CDS design in closing care gaps.
Alkhatib, S. A.; Jiwa, N.; Judd, D.; Luningham, J. M.; Sawyer-Morris, G.; Ulukaya, M.; Molfenter, T.; Taxman, F. S.; Walters, S. T.
Show abstract
Large language models (LLMs) are increasingly used for qualitative analysis in substance use research, yet their performance relative to human coders remains underexplored. This study compares ChatGPT-4.0 with human coders in identifying and describing the core innovation of NIH grants focused on reducing opioid overdose. A total of 118 NIH HEAL Initiative grant abstracts were independently coded by ChatGPT and humans to generate innovation descriptions, which were then evaluated by both human raters and ChatGPT for depth/detail and relevance/completeness using 5-point Likert scales. Identical instructions were used across all coding and evaluation stages. ChatGPT-generated descriptions were consistently rated higher than human-generated descriptions on both dimensions. Human evaluators rated ChatGPT outputs at an average of 4.47 for both depth/detail and relevance/completeness, compared to 3.33 and 3.24 for human outputs, respectively (F(1,176)=133.9, p<0.001). These findings suggest that LLMs, when carefully prompted, can enhance the efficiency and quality of qualitative research evaluation.
Basu, S.; Baum, A.
Show abstract
BackgroundClinicians in care management programs are often in low supply relative to patient demand, especially in US Medicaid programs, and must simultaneously address clinical risk, time efficiency, and patients social needs. Many studies have shown that large language models may assist in their tasks for summarizing patient care, such as in generating care plans; yet these studies also show that different objectives given to agents often conflict and produce problems for safety, efficiency and equity. We tested whether and to what degree using game theoretic approaches (a Nash bargaining framework) can produce care plans that advance multiple objectives across multiple language models, applying data from a real-world Medicaid cohort. MethodsWe conducted two studies in a cohort of 5,148 activated Medicaid care management patients (69.9% female; 45.7% Black or African American; mean age 40.9 years) enrolled in Virginia and Washington. A retrospective evaluation applied five deterministic strategies to the full cohort to characterize multi-objective trade-offs. A pre-registered controlled paired experiment (N = 200) assigned each patient one Nash-orchestrated multi-agent plan and one compute-matched sequential self-critique plan, generated by locally hosted open-source models (DeepSeek-R1 8B; Llama 3.1 8B) with no patient data leaving local infrastructure. Pre-specified outcomes were Safety, Efficiency, Equity, and Composite (mean of the three), each scored 0-1. Reporting follows CONSORT 2010 and STROBE. ResultsNash orchestration produced a Composite score of 0.755 (95% CI 0.751-0.760) versus 0.742 (95% CI 0.739-0.746) for the compute-matched baseline; the paired difference was 0.013 (95% CI 0.008-0.019; p = 6.20 x 10-). Safety and Efficiency paired differences were small-to-moderate in effect size (Cohens d = 0.327 and 0.543, respectively) with confidence intervals excluding zero. The Equity paired difference was 0.000 (95% CI -0.015 to 0.014; p = 0.987). ConclusionsRole-specialized Nash-orchestrated multi-agent language models produced measurably better Safety and Efficiency care plan quality than a compute-matched baseline under data-residency constraints. The null Equity result demonstrates that multi-objective role specialization does not automatically address equity--equity requires explicit design attention beyond composite weighting--with direct implications for responsible AI deployment in Medicaid care management. Author SummaryCare management programs for Medicaid patients need to address multiple goals at once: covering clinical risks, prioritizing the most impactful interventions, and recognizing the social barriers that affect whether patients can follow through on care plans. Prior research shows that automation tools powered by a single AI model tend to optimize for one of these goals at a time, sacrificing the others. We tested whether organizing several specialized AI agents -- each focused on a different goal -- and then combining their recommendations through a mathematical framework called Nash bargaining could produce better overall care plans for a real Medicaid population. We found that this multi-agent approach produced care plans that the AI judge rated as meaningfully safer and more efficient than plans generated by a single AI model using the same total amount of computation. However, the multi-agent approach did not produce plans that were more equitable in addressing patients social needs, suggesting that equity requires more direct attention as a design target rather than emerging from multi-objective combination alone. All AI inference was performed on locally hosted computers, with no patient information sent to outside services, reflecting the privacy requirements of real-world Medicaid care management programs.
Li, Y.; Zhou, H.; Blackley, S.; Plasek, J. M.; Lyu, Z.; Zhang, W.; You, J.; Centi, A.; Mishuris, R.; Yang, J.; Zhou, L.
Show abstract
Ambient intelligence-based systems are increasingly used for clinical documentation. To quantify linguistic differences associated with ambient documentation, we conducted a matched pre-post analysis of 6,026 outpatient clinical notes from Mass General Brigham following implementation of two ambient AI documentation systems (Nuance Dragon Ambient eXperience [DAX] and Abridge). Within-clinician comparisons focused on the History of Present Illness (HPI) and Assessment and Plan (A&P) sections and evaluated syntactic complexity, lexical ambiguity, linguistic variability, discourse coherence, and readability. Manual review of 50 paired notes was performed to validate findings from automated linguistic analyses. Our analyses indicate that the linguistic effects of ambient documentation are both vendor-dependent and section-specific. Across both vendors, ambient notes in HPI were longer and exhibited greater syntactic complexity (longer sentences and clauses, increased dependency distance), lower lexical ambiguity, lower language-model perplexity, and higher local and global discourse coherence. These findings indicate that ambient systems systematically restructure conversational input into more syntactically elaborated and linguistically predictable narratives, reflecting increased standardization relative to both general-domain and biomedical language models. In contrast, changes in A&P were smaller and more heterogeneous, consistent with its more structured/templated nature. Readability analyses further showed increased length and lexical complexity in ambient HPI, whereas A&P readability differences were minimal. Overall, our findings demonstrate that ambient documentation changes how clinical information is linguistically expressed and organized, with effects varying by note section, vendor, and provider role/specialty. Evaluation should therefore extend beyond efficiency to consider effects on communication, cognitive load, clinical inference, and downstream analytics.
Singh, P.; Gonuguntla, S.; Chen, E.; Pradhan, A.; Becker, I.; Xu, N.; Steel, B.; Arkam, F.; Yakdan, S.; Benedict, B.; Naveed, H.; Wang, W.; Guo, W.; Wilt, Z.; Badhiwala, J.; Hafez, D.; Ogunlade, J.; Ray, W. Z.; Ghogawala, Z.; Kelleher, C.; Greenberg, J. K.
Show abstract
Objective: Evaluating and monitoring patients with cervical spondylotic myelopathy (CSM) remains a challenge due to limited tools for assessing objective neurological disability longitudinally and in the home environment. Given their prevalence and low cost, mobile health (mHealth), and specifically smartphone technologies offer a promising approach to fill this gap. This study explored stakeholder perspectives on the role of mHealth in CSM monitoring to inform development of a smartphone-based assessment application. Methods: We conducted semi-structured interviews with 15 patients with CSM and 14 healthcare providers (spine surgeons, physical therapists, and occupational therapists). Interviews explored current assessment practices, perceived limitations, and attitudes toward mHealth integration. Data were analyzed using thematic analysis. Results: Two major themes emerged from provider interviews: (1) diagnosing and monitoring CSM is challenging due to limitations in current tools, and (2) mHealth presents significant opportunities but requires thoughtful integration. Providers described current methods and technologies, clinical signs and symptoms, and challenges evaluating patients. Current tools were viewed as inadequate for precision medicine, with inter-rater variability and inability to capture real-world function. Within the second theme, providers identified ways mHealth could improve care, challenges for integration, and practical implementation considerations. Patients expressed strong interest in objective, longitudinal monitoring of gait, dexterity, and daily function. Conclusions: Stakeholders recognized substantial potential for mHealth to address unmet needs in CSM assessment. Successful implementation will require intuitive design, electronic medical record integration, and attention to accessibility. These findings provide a foundation for user-centered development of digital health tools in CSM care.
Syed, M. A.; Alnuaimi, A. S.; El Kaissi, D. B.; Syed, M. A.
Show abstract
Background Artificial intelligence (AI) is increasingly being integrated into healthcare systems, with growing applications in clinical decision support, workflow optimization, and population health management. While substantial investments have been made in digital infrastructure, the successful adoption of AI in primary care depends critically on the readiness, awareness, and educational preparedness of healthcare professionals. Global health authorities emphasize the need for ethically grounded and workforce-focused approaches to AI integration; however, evidence on clinicians readiness for AI, particularly in primary care settings and in the Middle East region, remains limited. Objectives This study aims to assess the level of awareness, perceptions, attitudes, and educational needs related to AI among healthcare professionals working within Qatars Primary Health Care Corporation (PHCC). In addition, it seeks to examine organizational factors influencing the integration of AI-focused education in primary care and to develop an AI readiness framework that can inform targeted training strategies and policy planning. Methods This study will adopt a mixed-methods design guided by the Organizational Readiness for Change (ORC) framework, adapted for AI integration in primary care. The quantitative component will consist of an anonymous, census-style online survey distributed to all healthcare professionals across PHCC health centers and headquarters, assessing AI awareness, attitudes, training needs, and perceived infrastructure readiness. Composite AI awareness and attitude scores will be calculated, and regression analyses will be used to explore factors associated with AI readiness. The qualitative component will include semi-structured interviews and focus group discussions using maximum variation sampling to capture diverse professional perspectives. Qualitative data will be analyzed thematically, following COREQ and SRQR reporting standards. Quantitative and qualitative findings will be integrated to generate an AI readiness profile and an actionable education roadmap aligned with national digital health priorities. Discussion This study will provide the first comprehensive assessment of AI readiness among primary care healthcare professionals in Qatar. By identifying knowledge gaps, training priorities, and organizational enablers and barriers, the findings are expected to inform the development of evidence-based AI education strategies within continuing professional development frameworks. The proposed AI readiness framework may also offer a transferable model for other health systems seeking to align workforce development with responsible AI implementation in primary care.
McLaughlin, L.; Walz, M. S.; Arries, C.
Show abstract
Large language models (LLMs) are increasingly transforming scientific workflows, yet their application to rigorous evidence synthesis remains underexplored. Through the execution of a single Python script, we present a fully automated pipeline leveraging the Claude API to generate systematic reviews from literature search through manuscript completion without human intervention. Our pipeline processes hundreds of papers through iterative API calls for inclusion evaluation, information extraction, and synthesis, achieving citation accuracy rates of 95.87% through controlled text-restriction strategies that mitigate hallucination. In a blinded evaluation, six board-certified hematopathologists rated AI-generated systematic reviews (mean quality score: 3.4-3.66/5) higher than a published human-authored review (2.6/5) on the same topic, yet failed to reliably distinguish AI from human authorship. Notably, the human-written review was most frequently misidentified as AI-generated, revealing systematic biases in expert perception of AI capabilities. While demonstrating superior prose quality and topic coherence, AI-generated reviews exhibited increased repetition and had to be restricted to only referencing a select number of papers per section, highlighting fundamental trade-offs between automation scale and information breadth. Our findings establish both the technical feasibility and critical limitations of LLM-driven knowledge synthesis, raising urgent questions about verification standards, disclosure practices, and potential misuse in academic publishing. As automated high-quality scientific writing becomes computationally trivial, we argue for establishing transparent integration frameworks and enhanced AI literacy among domain experts to preserve scientific integrity while harnessing efficiency gains.
Windisch, P.; Weyrich, J.; Dennstaedt, F.; Zwahlen, D. R.; Foerster, R.; Schroeder, C.
Show abstract
PurposeLarge language models (LLMs) are used for biomedical text processing, but individual decisions are often hard to audit. We evaluated whether enforcing a mechanically checkable "show your work" quote affects accuracy, stability, and verifiability for trial eligibility-scope classification from abstracts. MethodsWe used 200 oncology randomized controlled trials (2005 - 2023) and provided models with only the title and abstract. Trials were labeled with whether they allowed for the inclusion of patients with localized and/or metastatic disease. Three flagship models (GPT-5.2, Gemini 3 Flash, Claude Opus 4.5) were queried with default settings in two independent conditions: label-only and label plus a verbatim supporting quote. Models could abstain if they deemed the abstract to not contain sufficient information. Each condition was repeated three times per abstract. Quotes were mechanically validated as exact substrings after whitespace normalization, and a separate judge step used an LLM to rate whether each quote supported the assigned label. ResultsEvidence requirements modestly reduced coverage (GPT-5.2 86.2% to 84.3%, Gemini 98.3% to 92.8%, Claude 96.0% to 94.5%) by increasing abstentions and, for Gemini, invalid outputs. Conditional macro-F1 remained high but changed by model (slight gains for GPT-5.2 and Gemini, decrease for Claude). Labels were stable across repetitions (Fleiss kappa 0.829 to 0.969). Mechanically valid quotes occurred in 83.3% to 91.2% of runs, yet only 48.0% to 78.8% of evidence-bearing predictions were judged semantically supported. Restricting to supported predictions increased macro-F1 at the cost of lower coverage. ConclusionSubstring-verifiable quotes provide an automated audit trail and enable selective, higher-trust automation when applying LLMs to biomedical text processing. However, this approach introduces new failure modes and trades coverage for verifiability in a model-dependent way.
Guo, Y.; Hu, D.; Yang, Z.; Chow, E.; Tam, S.; Perret, D.; Pandita, D.; Zheng, K.
Show abstract
Structured AbstractO_ST_ABSObjectiveC_ST_ABSThe use of ambient AI documentation tools is rapidly growing in US hospitals and clinics. Such tools generate the first draft of clinical notes from scribed patient-provider conversations, which clinicians can then review and edit before signing into electronic health records (EHR). Understanding how and why clinicians make modifications to AI-generated drafts is critical to improving AI design and clinical efficiency, yet it has been under-studied. This study aims to address this gap. Materials and MethodsWe conducted semistructured interviews with 30 clinicians from the University of California, Irvine Health who used a commercial ambient AI tool in routine outpatient care. We invited them to describe how and why they edited AI drafts based on both their personal experience and review of some real-world examples identified from our previous studies. ResultsModifications to AI drafts were primarily made to improve clinical accuracy and specialty-specific precision, reduce medico-legal and liability risk, and meet billing, coding, and documentation standards. Such editing was necessary due to reasons such as transcription errors, speaker attribution mistakes, overconfident statements without evidence, missing key clinical details, and AIs lack of information about the patient context. Conclusion and DiscussionImproving ambient AI documentation will require coordinated effort from vendors, institutions, and clinicians. Key targets include core model reliability (e.g., transcription accuracy), specialty-and encounter-level customization, clinician-level personalization, more effective EHR integration, and institutional support (e.g., training, governance, and standardized review guidance), complemented by clinicians adaptive communication strategies that strengthen human-AI collaboration.
OReilly, E.; Kurakovas, T.
Show abstract
BackgroundClinical risk prediction models are typically evaluated by discrimination (area under the receiver operating characteristic curve, AUC), with calibration receiving less attention. We developed a multi-timeframe diabetes prediction framework emphasizing calibration and used synthetic data validation to investigate whether good discrimination guarantees good calibration. MethodsWe generated 500,000 synthetic patients using published epidemiological parameters from QDiabetes-2018, FINDRISC, and the Diabetes Prevention Program. The framework comprises a discrete-time survival ensemble with isotonic calibration, producing predictions at 6, 12, 24, and 36 months with bootstrap confidence intervals. We evaluated discrimination (AUC), bin-level calibration (expected calibration error, ECE), calibration-in-the-large (observed-to-expected ratio), and clinical utility (decision curve analysis). We compared performance against QDiabetes-2018 implemented on the same synthetic cohort. ResultsDespite achieving excellent discrimination (AUC = 0.844, 95% CI: 0.840-0.848) and low bin-level calibration error (ECE = 0.006), the framework systematically overpredicted risk by 50%: mean predicted probability was 8.4% versus observed rate of 5.6% (observed-to-expected ratio = 0.66, 95% CI: 0.65-0.67). This miscalibration occurred despite isotonic regression on a held-out calibration set. Overprediction was present in 9 of 10 risk deciles. Risk stratification remained valid (23.5-fold separation, 95% CI: 22.8-24.3, between highest and lowest tiers), confirming that discrimination was preserved. QDiabetes-2018 achieved comparable discrimination (AUC = 0.831) with better calibration (O:E = 0.89). Decision curve analysis showed net benefit across threshold range 5-30%, though recalibration would improve clinical utility. ConclusionsGood discrimination does not guarantee good calibration. Our primary finding is negative: isotonic calibration failed to produce well-calibrated predictions even on synthetic data from a single generator. This has important implications for model deployment, where distribution shift is inevitable. We recommend that prediction model studies report calibration-in-the-large alongside bin-level metrics, as ECE alone can be misleading when risk distributions are skewed. Recalibration on deployment populations will likely be necessary for any prediction model, regardless of development-phase calibration performance. Key MessagesO_ST_ABSWhat is already knownC_ST_ABSO_LIClinical prediction models require both discrimination (ranking patients correctly) and calibration (accurate probability estimates) C_LIO_LIIsotonic regression is a recommended approach for post-hoc calibration C_LIO_LIExpected calibration error (ECE) is commonly reported as a summary calibration metric C_LI What this study addsO_LIDemonstrates empirically that excellent discrimination (AUC = 0.844) can coexist with substantial miscalibration (50% overprediction) C_LIO_LIShows that low ECE can be misleading when most patients fall in low-risk deciles C_LIO_LIProvides evidence that isotonic calibration on held-out data may not generalize even within synthetic data from one generator C_LIO_LIDemonstrates a discrete-time survival architecture that reduces monotonicity violations to <0.1% C_LI How this study might affect research, practice, or policyO_LIPrediction model studies should report calibration-in-the-large (O:E ratio) alongside ECE C_LIO_LIDevelopers should expect recalibration to be necessary when deploying to new populations C_LIO_LIClaims of calibrated prediction should be viewed skeptically without comprehensive calibration assessment C_LI
Sharma, K.; Sivadas, H.; Reddy, S.
Show abstract
AO_SCPLOWBSTRACTC_SCPLOWEmergency Department triage is a critical decision-making process in which clinicians must rapidly assess patient acuity under high cognitive load and time pressure. We present ED-Triage-Agent (ETA), a multi-agent AI framework designed to augment clinical decision-making in Emergency Severity Index (ESI) classification through human-AI collaboration. The system operates in two phases: (1) autonomous patient intake via a conversational agent that collects structured symptom histories and (2) collaborative acuity assessment in which specialized agents prioritize patients for vital sign collection and generate ESI classifications with explicit clinical reasoning. Unlike monolithic AI prediction systems, ETA mirrors clinical workflow by supporting decisions at each triage stage while preserving clinician autonomy. We describe the system architecture, agent design principles, and a preliminary evaluation methodology using the ESI Implementation Handbook case studies (60 standardized cases). This work proposes a model for deploying multi-agent AI systems in time-critical clinical environments where explainability and human oversight are essential. Code and the evaluation framework are available at https://github.com/Karthick47v2/ED-Triage-Agent.
Ekram, T. T.
Show abstract
BackgroundLarge language models (LLMs) are increasingly deployed in medical contexts as patient-facing assistants, providing medication information, symptom triage, and health guidance. Understanding their robustness to adversarial inputs is critical for patient safety, as even a single safety failure can lead to adverse outcomes including severe harm or death. ObjectiveTo systematically evaluate the safety guardrails of state-of-the-art LLMs through adversarial red-teaming specifically designed for medical contexts. MethodsWe developed a comprehensive taxonomy of 8 adversarial attack categories targeting medical AI safety, encompassing 24 distinct sub-strategies. Using an LLM-based attack generator, we created 160 realistic adversarial prompts across categories including dangerous dosing, contraindication bypass, emergency misdirection, and multi-turn escalation. We tested multiple leading LLMs (Claude Sonnet 4.5, GPT-5.2, Gemini 2.5 Pro, Gemini 3 Flash) using both single-turn and multi-turn attack sequences. All models received identical, standard medical assistant system prompts. An automated evaluator (Claude Sonnet 4.5) pre-screened responses for harm potential (0-5 scale) and guardrail effectiveness, with physician review planned for high-risk responses (harm_level [≥] 3). ResultsOf 160 adversarial prompts evaluated against Claude Sonnet 4.5, 11 (6.9%) elicited responses meeting our threshold for clinically significant harm (harm level [≥] 3 on a 0-5 scale). The model exhibited full refusal behavior in 86.2% of cases. Authority Impersonation was the dominant attack vector (45.0% success rate),s with the "Educational Authority" sub-strategy (framing requests as medical student questions) achieving 83.3% success -- the highest of any sub-strategy. Multi-turn escalation attacks achieved 0% success (0/20). Six of eight attack categories yielded no successful attacks. Physician review of the 11 flagged high-harm cases is in progress. ConclusionsStandard medical assistant system prompts provide strong baseline protection against most adversarial attacks, but are substantially vulnerable to authority impersonation -- particularly claims of educational context. The primary failure mode is behavioral mode-switching: the model provides clinically accurate but safety-framed-inadequately responses when it perceives a professional audience, rather than providing factually incorrect information. This suggests that guardrail improvements should target context-conditioned behavior rather than factual accuracy alone. Our open-source taxonomy and evaluation pipeline enable ongoing adversarial assessment as medical AI systems evolve. ImpactThis work provides the first systematic taxonomy and evaluation framework for medical AI adversarial testing, enabling developers to identify and remediate safety gaps before deployment. Our open-source attack taxonomy and methodology can serve as a foundation for ongoing red-teaming efforts as medical AI systems continue to evolve.